The course is about learning to use R with statistics in, with an emphasis on using it with github. My objective for attending the course is refreshing my knowledge on R and statistics in general. The course has been on my “TODO” plan on PhD studies for a while.
I think the R Data Health book works well for an introduction, although - as I’m not a complete novice. I didn’t read the chapters quite thoroughly - there’s a lot to take in in short time (I missed the exercises in chapter 3, but not chapter 4). Perhaps the book works more as a sample now and later as a reference.
There have been quite a few new things since I last used R, like the
tidyverse pipe. I’m glad there is a possibility of using the
“language-agnostic standard” of using = for assignment
instead of (or in addition to) <-, which I’ve always
found odd. A lot of things are similar to features in pandas in Python
data analysis (not surprising, as pandas has taken quite a bit of
inspiration from R).
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 27 17:21:00 2023"
GitHub repository: https://github.com/dustedmtl/IODS-project
Data analysis
date()
## [1] "Mon Nov 27 17:21:00 2023"
The dataset that is read in contains averaged survey scores for three subjects, student attitude, exam points, age, and gender. There are 166 students.
lr2 <- read.table('data/learning2014.csv', sep=',', header = T)
dim(lr2)
## [1] 166 7
summary(lr2)
## gender age attitude points
## Length:166 Min. :17.00 Min. :14.00 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Mode :character Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p <- ggpairs(lr2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The score variables deep, surf and stra are distributed between 1 and 5.
Correlations in the data:
points and attitude are strongly correlated
points and stra are also correlated; likewise points and surf (negative correlation)
age skews young and it isn’t strongly correlated with anything.
Linear regression is used to determine relation between a response variable and one of more (if multiple regression) explanatory variables. For a single variable, the equation to solve is:
\(Y = \alpha + \beta X + \epsilon\)
where all the terms are \(N\)-dimensional vectors (for \(N\) observations). \(Y\) is the response variable and \(X\) is the explanatory variable. The \(\epsilon\) error term is assumed to normally distributed with a mean of 0.
The task is to find such coefficients for \(\alpha\) and \(\beta\) that minimize the sum \(\sum_{i=1}^{N} \epsilon_i^2\). This is called the least squares method.
For multiple linear regression the equation would be similarly \(Y = \alpha + X\beta + \epsilon\), where \(X\) would be a matrix of dimensions \(N * k\) (with \(k\) explanatory variables) instead of a \(N * 1\)-dimensional vector.
Linear regression model for variable points based on three explanatory variables: attitude, stra and surf:
model <- lm(points ~ attitude + stra + surf, data = lr2)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lr2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The test that the fitting is done is omnibus F-test, which tests the hypothesis that coefficients for all three variables are zero. The very low p-value for intercept means that this hypothesis is not likely to be true.
The p-value for attitude (below 0.05) shows that the model fitting is reliable for it.
Multiple R squared value of 0.2074 means that the three variables account for 20 % of variation.
# All four in the same plot
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Above are three plots regarding residuals, where residual is a difference between the fitted and actual value.
The first plot (residuals vs fitted) is symmetrical and there is no correlation between residual and fitted value. Linear regression model is appropriate here.
The second plot shows quantiles against each other. This plot is linear too, so linear regression model is still valid.
The last plot uses standardized residuals with identical variance. The result is similar to plot 1.
date()
## [1] "Mon Nov 27 17:21:02 2023"
In the previous section we covered linear regression. Here we consider some cases where its use may be inappropriate. In particular, linear regression relies on the response variable being normally distributed. It might not even be continuous. In the case of a binary response variable we would prefer to use logistic regression
\(log(\frac p {1-p}) = \alpha + \beta * X\)
where \(p\) is the probability of one outcome and \(1-p\) probability of the other outcome, with \(\beta\) and \(X\) being coefficient and variable vectors for explanatory variables, which are multiplied element-wise.
Here \(\frac p {1-p}\) are the odds for one outcome against another outcome (the usability of using odds instead of probabilities comes from fact that the fitted lines featuring the former are expected to be straight)
The data set for analysis comes from studies about Portuguese students’ alcohol consumption:
http://www.archive.ics.uci.edu/dataset/320/student+performance
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- read.table('data/alc_data.csv', sep=',', header = T)
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
The data has been combined from mathematics and Portuguese classes:
The data has a number of variables related to grading (G1, G2 and G3), absences, extra classes (paid) and previous failures (failures). It also has information about alcohol usage, including a boolean column for high alcohol usage (high_use). The rest are a variety of either numerical or categorical metadata columns.
The objective is to analyze the relation of alcohol consumption to the various categories. The hypothesis is that high alcohol consumption will lead to
lower grades (especially G3)
more absences
The student will also have spent less time on studying (studytime) and will not have had paid classes.
library(tidyr); library(dplyr); library(ggplot2)
# install.packages("patchwork")
library(patchwork)
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 <- g1 + geom_boxplot() + ylab("final grade")
g11 <- ggplot(alc, aes(x = sex, y = G3, col = high_use))
g11 <- g11 + geom_boxplot() + ylab("final grade")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 <- g2 + geom_boxplot() + ylab("absences")
g3 <- ggplot(alc, aes(x = studytime, y = high_use))
g3 <- g3 + geom_col() + ylab("high use")
g4 <- ggplot(alc, aes(x = paid, y = high_use))
g4 <- g4 + geom_boxplot() + ylab("high use")
g1 + g11 + g2 + g3
Grades are lower and there are more absences with high alcohol usage, although the effect is more pronounced for men than women. Students with high alcohol consumption spent a lot less time on studying.
gh <- ggplot(alc, aes(x = studytime))
gh <- gh + geom_histogram()
gh2 <- ggplot(alc, aes(x = studytime))
gh2 <- gh2 + geom_histogram()
gh + gh2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
table(high_use = alc$high_use, paid = alc$paid)
## paid
## high_use no yes
## FALSE 140 119
## TRUE 56 55
Those with low alcohol consumption used fewer paid classes, although the difference is not dramatic. My hypothesis was thus incorrect; it’s possible that students who use a lot of alcohol need the extra classes (perhaps to compensate for lack of study time?).
m <- glm(high_use ~ G3 + absences + studytime + paid, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + studytime + paid, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.44301 0.51534 0.860 0.389984
## G3 -0.06369 0.03725 -1.710 0.087317 .
## absences 0.07688 0.02285 3.365 0.000766 ***
## studytime -0.58041 0.16270 -3.567 0.000361 ***
## paidyes 0.40796 0.24619 1.657 0.097501 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 414.63 on 365 degrees of freedom
## AIC: 424.63
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) G3 absences studytime paidyes
## 0.44300723 -0.06368689 0.07687909 -0.58041234 0.40795610
The low P-values for absences and studytime indicate high correlation. Surprisingly (?) grades do not.
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.5573836 0.5663598 4.305756
## G3 0.9382987 0.8717178 1.009274
## absences 1.0799115 1.0350021 1.132305
## studytime 0.5596675 0.4024071 0.762707
## paidyes 1.5037411 0.9304002 2.446804
For grades (G3) and paid classes, 1.0 is within the confidence interval, thus there is no evidence that these variables are associated with high alcohol usage.
m2 <- glm(high_use ~ absences + studytime, data = alc, family = "binomial")
probabilities <- predict(m2, type = "response")
# add probability
alc2 <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 93 18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67027027 0.02972973 0.70000000
## TRUE 0.25135135 0.04864865 0.30000000
## Sum 0.92162162 0.07837838 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$prediction)
## [1] 0.2810811
The probability of incorrect predictions is 28.1 %.
table(high_use = alc2$high_use, high_absences = alc2$absences > 1) %>% prop.table() %>% addmargins()
## high_absences
## high_use FALSE TRUE Sum
## FALSE 0.23513514 0.46486486 0.70000000
## TRUE 0.07027027 0.22972973 0.30000000
## Sum 0.30540541 0.69459459 1.00000000
table(high_use = alc2$high_use, low_study = alc2$studytime < 2) %>% prop.table() %>% addmargins()
## low_study
## high_use FALSE TRUE Sum
## FALSE 0.5486486 0.1513514 0.7000000
## TRUE 0.1864865 0.1135135 0.3000000
## Sum 0.7351351 0.2648649 1.0000000
table(high_use = alc2$high_use, low_study_high_absence = alc2$studytime < 2 | alc2$absences > 1) %>% prop.table() %>% addmargins()
## low_study_high_absence
## high_use FALSE TRUE Sum
## FALSE 0.19189189 0.50810811 0.70000000
## TRUE 0.05405405 0.24594595 0.30000000
## Sum 0.24594595 0.75405405 1.00000000
Guessing based on simple metrics such as low studytime and high absences it is not possible to get good predictions.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2945946
Unexpectedly the cross-validation does not improve the previous result.
date()
## [1] "Mon Nov 27 17:21:03 2023"
In multivariate analysis several variables are analysed at the same. The difference between analysing a single response variable and multivariate analysis is that in the latter you do not expect to “explain” a single variable in terms of another, rather than (often) exploring connections between the variables.
It is useful to graphically explore the relations between variables, for example using boxplots, scatterplots. Normality of the data can be assessed with quantile plots.
The purpose of clustering is to group items into mutually exlcusive groups. The members of a single group should be closer to each other than to members of other groups.
A common metric for determining the difference between items is the Euclidean distance:
\(d_{ij} = \sqrt{\sum_{k=1}^{q}(x_{ik}-x{jk})^2}\)
Another option is manhattan distance, which assumes “block-wise” traversal, i.e. you can only go up/down or north/south along the streets.
The distance metrics are obviously sensitive to different scales; for it to work the data may have to be scaled around the mean and normalized according to standard deviation:
\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)
In hierarchical clustering each item is iteratively connected to their closest sub-clusters (either single items or clusters previously defined based on them). There are three common ways to measure the distance between clusters:
Single linkage
Maximum linkage
Average linkage
In single linkage, the distance between clusters is based on the closest distance between any members of the respective clusters. Maximum and average linkage are based on maximum and average distance, respectively. Hieratchical clustering produces a tree-like dendrogram.
The k-means algorithm seeks group items into \(k\) clusters based on some criteria. Distance metrics (over all items) may be used. The issue with this method is that for a large amount of data, calculating the overall distances may be computationally infeasible. An alternative is to choose some initial clustering and then iteratively make small changes to them, only keeping those changes that improve whatever criteria are used to measure “best fit”.
We are analysing Boston housing market data and its relations to various variables. The details about the data set are found here:
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
## corrplot 0.92 loaded
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
In addition to median house price, the variables include: crime rate; taxes; ratio of black population; access to education, employment and transportation, among other things.
Renaming the variables to be more descriptive:
names(Boston) = c("crime", "zone.pct", "ind.pct",
"river", "nox.ppm",
"rooms.avg", "age.pct",
"dist", "roads.idx", "tax",
"pupil.ratio", "black",
"stat.pct", "price")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crime : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zone.pct : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ ind.pct : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ river : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox.ppm : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rooms.avg : num 6.58 6.42 7.18 7 7.15 ...
## $ age.pct : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dist : num 4.09 4.97 4.97 6.06 6.06 ...
## $ roads.idx : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ pupil.ratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ stat.pct : num 4.98 9.14 4.03 2.94 5.33 ...
## $ price : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
## crime zone.pct ind.pct river nox.ppm rooms.avg age.pct dist roads.idx tax
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222
## pupil.ratio black stat.pct price
## 1 15.3 396.90 4.98 24.0
## 2 17.8 396.90 9.14 21.6
## 3 17.8 392.83 4.03 34.7
## 4 18.7 394.63 2.94 33.4
## 5 18.7 396.90 5.33 36.2
## 6 18.7 394.12 5.21 28.7
The range of the values of the data vary; some are percentages, some are ratios, the rest have a variety of positive values (mainly above 1).
cor_matrix <- cor(Boston) %>% round()
corrplot(cor_matrix,
method="circle", type = "upper",
cl.pos = "b",
tl.pos = "d", tl.cex = 0.6)
The correlation matrix shows positive correlations between crime and taxation and road access. The house price is positively correlated with average number of rooms (there is no variable for house size itself) and negatively correlated with pupil-teacher ratio (i.e. fewer teachers per pupil) and lower status of the population.
Access to river doesn’t seem to be correlated with anything.
To be able to properly analyze the data, we need to scale it.
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crime <- as.numeric(boston_scaled$crime)
summary(boston_scaled$crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
We create a categorical variable for the crime rates.
bins <- quantile(boston_scaled$crime)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
breaks = bins,
include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high")
)
boston_scaled <- dplyr::select(boston_scaled, -crime)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
## zone.pct ind.pct river nox.ppm rooms.avg age.pct dist
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
## roads.idx tax pupil.ratio black stat.pct price crime
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278 low
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239 low
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375 low
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886 low
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323 low
## 6 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582 low
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Using crime as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The newly scaled variables have a mean of zero.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
LDA is used to reduce the dimensions of the data. Here we want to see which variables are of most relevant to crime rates.
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2623762 0.2376238 0.2500000
##
## Group means:
## zone.pct ind.pct river nox.ppm rooms.avg age.pct
## low 1.0130351 -0.8843534 -0.11640431 -0.8877748 0.4016937 -0.8892838
## med_low -0.1223800 -0.2803620 0.02481057 -0.5439028 -0.0970797 -0.3520165
## med_high -0.3693281 0.2338000 0.21980846 0.4504817 0.1221638 0.3725393
## high -0.4872402 1.0171306 -0.03844192 1.0728110 -0.3669600 0.7969423
## dist roads.idx tax pupil.ratio black stat.pct
## low 0.9016778 -0.6919117 -0.7678746 -0.3750530 0.3787090 -0.77788777
## med_low 0.3032925 -0.5549882 -0.4764159 -0.0849146 0.3253694 -0.16423052
## med_high -0.4202793 -0.4207972 -0.3056799 -0.4072044 0.1008959 -0.01456376
## high -0.8442123 1.6379981 1.5139626 0.7806252 -0.8084572 0.94303396
## price
## low 0.50261900
## med_low 0.04269446
## med_high 0.23914977
## high -0.75208073
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zone.pct 0.10508373 0.76695007 -0.996090748
## ind.pct 0.09060594 -0.21519401 0.039314143
## river -0.08983175 -0.04335498 0.140253245
## nox.ppm 0.34706647 -0.77212168 -1.348407409
## rooms.avg -0.08931250 -0.07466255 0.009250446
## age.pct 0.23625387 -0.15063742 -0.167613813
## dist -0.06004226 -0.20979392 0.114391468
## roads.idx 3.47749262 1.03091451 -0.289407677
## tax -0.06994686 -0.18498948 0.836434126
## pupil.ratio 0.11359242 0.06407304 -0.235309144
## black -0.12366816 0.04762272 0.116339664
## stat.pct 0.20591197 -0.27736956 0.496760751
## price 0.18442180 -0.51471745 -0.213019111
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9505 0.0369 0.0126
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
The first component explains 95% of the variable to be analyzed. The highest coefficient for it is roads.idx, that is, index of accessiblity to radial highways, followed by nitrous oxide pollution, percentage of old houses and lower status of population.
# target classes as numeric
classes <- as.numeric(train$crime)
plot(lda.fit,
dimen = 2,
col = classes,
pch = classes)
lda.arrows(lda.fit, myscale = 2)
The plot confirms that it is the roads access that has most relevance to crime rates.
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 2 0
## med_low 5 13 2 0
## med_high 0 16 12 2
## high 0 0 0 26
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 0.11764706 0.11764706 0.01960784 0.00000000 0.25490196
## med_low 0.04901961 0.12745098 0.01960784 0.00000000 0.19607843
## med_high 0.00000000 0.15686275 0.11764706 0.01960784 0.29411765
## high 0.00000000 0.00000000 0.00000000 0.25490196 0.25490196
## Sum 0.16666667 0.40196078 0.15686275 0.27450980 1.00000000
The method correctly classifies 74.5 % of the items.
sum(correct_classes == lda.pred$class) / length(correct_classes)
## [1] 0.6176471
# center and standardize variables
boston_scaled2 <- scale(Boston)
names(boston_scaled2) = c("crime", "zone.pct", "ind.pct",
"river", "nox.ppm",
"rooms.avg", "age.pct",
"dist", "roads.idx", "tax",
"pupil.ratio", "black",
"stat.pct", "price")
# change the object to data frame
boston_scaled2 = as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
The Euclidean and Manhattan distances for the scaled dataset.
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
km2 <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km2$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d', mode='markers',
color=train$crime,
)
Some clustering is shown based on the crime categories.
(more chapters to be added similarly as we proceed with the course!)